1 Introduction

The major parts of the following analyse have been inspired by the pipeline used in the article Rossi et al. Cell Stem Cell 2020. The associated code is available on github, at the Cardiac_Gastruloids repository.

The idea of the analysis is to analyze and compare the data obtained in the lab to the ones retrieved from the Rossi et al. article. The data used in the Rossi et al. article will be named “reference data” in the rest of the document. These data were recovered from the GEO website with the accession number GEO: GSE158999. The accessed files are:

  • barcodes.tsv.gz,

  • features.tsv.gz,

  • matrix.mtx.gz,

  • metadata.tsv.gz.

The Rossi et al. article also refers and uses the in vivo atlas from Pijuan-Sala Griffiths Gottgens et al. Nature 2019. Indeed, in Rossi et al, a classifier has been trained on the atlas data. I directly download the classifier from the above mentioned github repository.

I also downloaded the data of the atlas of Pijan-Sala et al. using a terminal with the following command lines

curl https://content.cruk.cam.ac.uk/jmlab/atlas_data.tar.gz > atlas_data.tar.gz
tar -zxvf atlas_data.tar.gz

2 General settings

2.1 Directories managment

To reproduce the analysis, it is recommended to set up all these directories. I assume that all the input files are stored in the same directory inputData. This directory is divided in 5 subdirectories:

  • 2 of them are dedicated to the raw data to be analyzed, from the Marseille Medical Genetics (MMG) laboratory and reference datasets (zaffranRawData and rossiRawData, respectively),

  • 1 directory contains the data to create an atlas object from Pijuan-Sala et al. (atlas),

  • 1 directory stores the classifier provided by the reference Rossi et al. based on the atlas of Pijuan-Sala et al. (scmap_pijuansala_classifier),

  • 1 directory contains input tables provided by the reference Rossi et al. (InputTables).

basePath <- "XXXXX"  # to be seen as the root for any analysis based on a set of input data
# input paths
atlas.directory <- paste0(basePath, "inputData/atlas/")
classifier.folder <- paste0(basePath, "inputData/scmap_pijuansala_classifier/")
inputTables.folder <- paste0(basePath, "inputData/InputTables/")

# output paths
baseAnalysis <- paste0(basePath, "XXXXX")
if (!dir.exists(baseAnalysis)) {
    dir.create(baseAnalysis)
}
rdsObjects <- paste0(baseAnalysis, "rdsObjects/")
if (!dir.exists(rdsObjects)) {
    dir.create(rdsObjects)
}
atlas.folder <- paste0(baseAnalysis, "atlas/")
if (!dir.exists(atlas.folder)) {
    dir.create(atlas.folder)
}
fig.folder <- paste0(baseAnalysis, "figures/")
if (!dir.exists(fig.folder)) {
    dir.create(fig.folder)
}

2.2 Load custom color tables

To keep a coherent coloration on the plots through the analysis, I set up a vector of colors.

colors.table <- read.table(file = paste0(inputTables.folder, "ClusterColors.tsv"), sep = "\t", header = T, 
    comment.char = "", as.is = T)
colors.use_ab.initio <- setNames(colors.table$blind_friendly[!is.na(colors.table$ab_initio_identity)], colors.table$ab_initio_identity[!is.na(colors.table$ab_initio_identity)])
colors.use_transferred <- setNames(colors.table$blind_friendly[!is.na(colors.table$transferred_identity)], 
    colors.table$transferred_identity[!is.na(colors.table$transferred_identity)])
colors.use_stages <- setNames(c("#be465c", "#b04b3d", "#c86633", "#d0a63f", "#998138", "#90b03d", "#60a756", 
    "#45c097", "#5e8bd5", "#6d71d8", "#573585", "#bd80d5", "#b853a2", "#ba4b7d"), c("Day_04", "Day_05", "Day_06", 
    "Day_07", "Day_10", "E6.5", "E6.75", "E7.0", "E7.25", "E7.5", "E7.75", "E8.0", "E8.25", "E8.5"))

2.3 Custom variables for the analysis

Some following functions can work with parallelization. Here I configure how parallelization should work.

# Parallelization
plan(strategy = "multisession", workers = 4)
# plan(batchtools_slurm, template =
# '/shared/projects/mothard_in_silico_modeling/slurm_scripts/aa_seuratAn.tmpl')
options(future.globals.maxSize = 62914560000)  #60GB
options(future.seed = 26)

3 Reference data

3.1 Load the reference dataset

The Read10X function only needs the directory in which the input files have been stored. It creates an object interpreting the barcodes, features and matrix files.
The CreateSeuratObject function initialize a seurat object.

ref.dataset.folder <- paste0(basePath, "inputData/ref_data/")

ref.raw.data <- Read10X(data.dir = ref.dataset.folder, gene = 1)
ref.4d.SO <- CreateSeuratObject(counts = ref.raw.data, project = "recoverAnalysis")  #, assay='RNA', min.cells=3, min.features=100)

# Dimensions of the object
head(ref.4d.SO)
dim(ref.4d.SO)
## An object of class Seurat 
## 1 features across 30496 samples within 1 assay 
## Active assay: RNA (1 features)
## [1] 23961 30496

The reference dataset contains 30496 cells on which 23961 features (genes) were detected.

3.2 Add metadata

The metadata file was created in the Rossi et al. analysis, and provided to guide users. I add the stage and Replicate metadata to the seurat object under the names day and replicate column name respectively.

ref.md <- read.table(file = paste0(ref.dataset.folder, "metadata.tsv.gz"), sep = "\t", header = TRUE, stringsAsFactors = F)
ref.4d.SO <- AddMetaData(object = ref.4d.SO, metadata = ref.md$stage, col.name = "day")
ref.4d.SO <- AddMetaData(object = ref.4d.SO, metadata = ref.md$Replicate, col.name = "replicate")

Here are all the metadata information contained in the seurat object of the reference dataset: str(ref.4d.SO@meta.data)

## 'data.frame':    30496 obs. of  5 variables:
##  $ orig.ident  : Factor w/ 1 level "GAS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ nCount_RNA  : num  53183 41938 43631 69730 64691 ...
##  $ nFeature_RNA: int  6676 5225 5869 7090 7039 5568 6208 5684 3450 6052 ...
##  $ day         : chr  "Day4" "Day4" "Day4" "Day4" ...
##  $ replicate   : int  0 0 0 0 0 0 0 0 0 0 ...

There are 5 metadata stored in a data frame. Each of them describe the cells. Besides the imported metadata from the external file (day and replicate), there are n_Count_RNA and nFeature_RNA information:

  • nCount_RNA is the number of reads counted in every droplet with a gel bead associated to a cell,

  • nFeature_RNA is the number of different features (genes) detected in every droplet.

We can get the different values a metadata can take by the following command. Here is an example for the metadata named day: unique(ref.4d.SO@meta.data$day)

## [1] "Day4" "Day5" "Day6" "Day7"

3.2.1 Homogenize labels (day, dataset)

For the sake of the analysis, I need to manipulate the labels to make them coherent between all the labels.

# modify day label
ref.4d.SO@meta.data$day <- gsub("^(Day)([0-9]*)$", "\\1_0\\2", ref.4d.SO@meta.data$day)

# create new metadata named dataset
ref.4d.SO$dataset <- paste0("ref_", tolower(ref.4d.SO@meta.data$day))
Idents(ref.4d.SO) <- ref.4d.SO$dataset

ref_dataset_names <- unique(ref.4d.SO$dataset)

A new metadata has been created. The values this metadata can take are among ref_day_04, ref_day_05, ref_day_06, ref_day_07 depending on the origin and the stage (day) of the sequenced cells.

Now here’s what the metadata looks like:

orig.ident nCount_RNA nFeature_RNA day replicate dataset
GAS_Day4_batch4_AAACCCATCGACTCCT GAS 53183 6676 Day_04 0 ref_day_04
GAS_Day4_batch4_AAACGAAGTCATAACC GAS 41938 5225 Day_04 0 ref_day_04
GAS_Day4_batch4_AAACGCTCAAGCGAGT GAS 43631 5869 Day_04 0 ref_day_04
GAS_Day4_batch4_AAACGCTCACCCTAAA GAS 69730 7090 Day_04 0 ref_day_04
GAS_Day4_batch4_AAAGAACAGCCTTGAT GAS 64691 7039 Day_04 0 ref_day_04
GAS_Day4_batch4_AAAGAACAGTTTCGAC GAS 39653 5568 Day_04 0 ref_day_04

3.3 Split into sub-datasets

For the following parts of the analysis, I will split the dataset ros.4d.SO into 4 sub-datasets. I use the SplitObject function from the Seurat package (PROVIDE REFERENCE). By providing an attribute (metadata column name), the object is splitted into a list of subsetted objects.

I split the dataset according to the day the gastruloids were sequenced.

# Split the object
refSO.list <- SplitObject(ref.4d.SO, split.by = "day")

# get dataset names
names(refSO.list) <- ref_dataset_names

# set up the project name of each sub-dataset
for (SO in refSO.list) {
    projName <- unique(SO@meta.data$dataset)
    refSO.list[[projName]]@project.name <- projName
}

Size of the entire dataset: dim(ref.4d.SO)

Dataset Nbr of features Nbr of cells
ref_4_days 23961 30496

Size of sub-datasets:

size_subref <- data.frame(cbind(sapply(refSO.list, function(SO) SO@project.name), t(sapply(refSO.list, dim))))
colnames(size_subref) <- c("Dataset", "Nbr of features", "Nbr of cells")
knitr::kable(size_subref, align = "lrr")
Dataset Nbr of features Nbr of cells
ref_day_04 ref_day_04 23961 6898
ref_day_05 ref_day_05 23961 7783
ref_day_06 ref_day_06 23961 8313
ref_day_07 ref_day_07 23961 7502

4 MMG laboratory data

The lab got the scRNAseq data day by day. The sequencing saturation was not good enough for the day 5. As the consequence, this day has been resequenced. It results in that I have five datasets: one dataset for the days 4, 6 and 10 and two datasets for the day 5.

For each dataset, I will execute the following steps:

  • load and interprete raw data (barcodes, features and matrix files) with Read10X function, and create a seurat object (CreateSeuratObject function),

  • add metadata and,

  • rename the cells.

4.1 Load the 5 datasets

There are five sub-datasets to load. I create a seurat object for each of them that I store in a list. The project name of each sub-dataset is already given.

# get path to the folders of each sub-dataset
dirs <- list.dirs(paste0(basePath, "inputData/lab_data"))
lab_dirs <- dirs[2:length(dirs)]

# create a list of seurat objects
labSO.list <- list()
lab_dataset_names <- c()
for (x in lab_dirs) {
    dataset.name <- str_extract(x, "[a-z0-9_]*$")
    lab_dataset_names <- append(lab_dataset_names, dataset.name)

    raw.data <- Read10X(data.dir = x, gene = 2)
    SO <- CreateSeuratObject(counts = raw.data, project = dataset.name)  #, assay='RNA', min.cells=3, min.features=100)

    labSO.list[[dataset.name]] <- SO
}

labSO.list
## $lab_day_04
## An object of class Seurat 
## 32286 features across 7324 samples within 1 assay 
## Active assay: RNA (32286 features)
## 
## $lab_day_05
## An object of class Seurat 
## 32286 features across 7794 samples within 1 assay 
## Active assay: RNA (32286 features)
## 
## $lab_day_05bis
## An object of class Seurat 
## 32286 features across 7729 samples within 1 assay 
## Active assay: RNA (32286 features)
## 
## $lab_day_06
## An object of class Seurat 
## 32286 features across 6628 samples within 1 assay 
## Active assay: RNA (32286 features)
## 
## $lab_day_11
## An object of class Seurat 
## 32287 features across 9584 samples within 1 assay 
## Active assay: RNA (32287 features)

As well as for the project names for the reference sub-datasets, the project name of the lab sub-datasets take the values among: lab_day_04, lab_day_05, lab_day_05bis, lab_day_06, lab_day_11

4.2 Add metadata

I need to add metadata like for the reference dataset. I will add the information of the day the gastruloids were sequenced, the dataset name - like the reference dataset name were built - and the replicate.

The replicate metadata is required as there are two datasets sequenced on the day 5. It will be useful to differenciate cells between both datasets.

labSO.list <- lapply(labSO.list, function(SO) {
    dataset <- SO@project.name
    day <- str_to_title(str_extract(dataset, "day_[0-9]*$"))

    SO <- AddMetaData(object = SO, metadata = day, col.name = "day")
    SO <- AddMetaData(object = SO, metadata = dataset, col.name = "dataset")

    if (dataset == "lab_day_05bis") {
        SO <- AddMetaData(SO, metadata = 1, col.name = "replicate")
    } else {
        SO <- AddMetaData(SO, metadata = 0, col.name = "replicate")
    }
})

knitr::kable(data.frame(head(labSO.list[[1]]@meta.data)), align = "lrrrrrr")

5 Nomenclature for cell names

The names of the cells have a different nomenclature between the reference and the MMG laboratory sub-datasets. On one hand, the reference cell names are already custom names with the biological origin (GAS for gastruloids), the days the gastruloid were sequenced, the batch number and the UMI. On the other hand, the laboratory cell names only consists on the UMIs.

The library used by 10X Genomics is limited, and provide a limited number of unique UMI barcodes. (PROVIDE DETAILS). Hence, working with data obtained from multiple sequencing experiments might lead to an overlap fo UMI barcodes between datasets.

Hence, I decide to create a nomenclature to rename all the cells under a unique identifier. Thus, each cell will be identified by

  • ref/lab whether the data come from Rossi et al. or the MMG laboratory,

  • the dataset value previously added to the metadata,

  • the replicate value, also present in the metadata and

  • the UMI provided by 10x Genomics.

rename_cells <- function(SO) {
    cell_ids <- colnames(SO)
    UMIs <- str_extract(cell_ids, "[A-Z]*$")
    cellnames <- paste(SO$dataset, SO$replicate, UMIs, sep = "_")

    SO <- RenameCells(SO, new.names = cellnames)
}

refSO.list <- lapply(refSO.list, rename_cells)
labSO.list <- lapply(labSO.list, rename_cells)
print(colnames(refSO.list$ref_day_04)[1])
print(colnames(labSO.list$lab_day_04)[1])
## [1] "ref_day_04_0_AAACCCATCGACTCCT"
## [1] "lab_day_04_0_AAACCCAAGTACAGAT"

We can see that there is as many unique cell identifiers as the number of cells among all the sub-datasets.

nbCells <- Reduce("+", lapply(refSO.list, function(SO) dim(SO)[2])) + Reduce("+", lapply(labSO.list, function(SO) dim(SO)[2]))

nbIDs <- Reduce("+", lapply(refSO.list, function(SO) length(unique(colnames(SO))))) + Reduce("+", lapply(labSO.list, 
    function(SO) length(unique(colnames(SO)))))

print(paste("Nbr of cells among all the sub-datasets:", nbCells))
print(paste("Nbr of cell identifiers:", nbIDs))
## [1] "Nbr of cells among all the sub-datasets: 69555"
## [1] "Nbr of cell identifiers: 69555"

Now, all the sub-datasets are prepared for the analysis. First, I will proceed to the quality control of the cells, after what I will realize the standard preprocessing workflow of Seurat. Then, I will be able to remove the doublets using DoubletFinder library (PROVIDE SOURCE). Finally, I will investigate if there is sources of unwanted variation among the replicates.

First and foremost, I gather all the data into a common object. Thus, they are stored in a list. The computational advantage of it takes place in the use of the function of lapply() . It allows to apply a function to all members of a list in one command.
All the sub-datasets (reference and lab data) will be analyzed on the same pipeline.

allSO.list <- do.call(c, list(refSO.list, labSO.list))
# allSO.list <- append(refSO.list, labSO.list) # Equivalent allSO.list <- c(refSO.list, labSO.list) #
# Equivalent
SO.names <- names(allSO.list)
rm(refSO.list, labSO.list)
gc(verbose = FALSE)
str(allSO.list, max.level = 2)
##             used   (Mb) gc trigger    (Mb)   max used    (Mb)
## Ncells   6377269  340.6   11440542   611.0    8804052   470.2
## Vcells 990855682 7559.7 1475139588 11254.5 1339408159 10218.9
## List of 9
##  $ ref_day_04   :Formal class 'Seurat' [package "Seurat"] with 12 slots
##  $ ref_day_05   :Formal class 'Seurat' [package "Seurat"] with 12 slots
##  $ ref_day_06   :Formal class 'Seurat' [package "Seurat"] with 12 slots
##  $ ref_day_07   :Formal class 'Seurat' [package "Seurat"] with 12 slots
##  $ lab_day_04   :Formal class 'Seurat' [package "Seurat"] with 12 slots
##  $ lab_day_05   :Formal class 'Seurat' [package "Seurat"] with 12 slots
##  $ lab_day_05bis:Formal class 'Seurat' [package "Seurat"] with 12 slots
##  $ lab_day_06   :Formal class 'Seurat' [package "Seurat"] with 12 slots
##  $ lab_day_11   :Formal class 'Seurat' [package "Seurat"] with 12 slots

Now, all sub-datasets Seurat objects are stored into a list. The names of the list are the project.name of each sub-dataset.

As the number of cells could decrease during the following steps, I will print a table with the number of cells, but also the number of detected features in each sub-dataset. A barplot will be also displayed to see the evolution of each sub-dataset along the analysis.

Size of all the sub-datasets:

size_suball <- data.frame(cbind(sapply(allSO.list, function(SO) {
    if (str_detect(SO@project.name, "^ref")) {
        "ref"
    } else {
        "lab"
    }
}), sapply(allSO.list, function(SO) {
    if (str_detect(SO@project.name, "04")) {
        "4"
    } else if (str_detect(SO@project.name, "05bis")) {
        "5bis"
    } else if (str_detect(SO@project.name, "05")) {
        "5"
    } else if (str_detect(SO@project.name, "06")) {
        "6"
    } else if (str_detect(SO@project.name, "07")) {
        "7"
    } else if (str_detect(SO@project.name, "11")) {
        "11"
    }
}), t(sapply(allSO.list, dim)), "Preliminary Counts", "01"))
colnames(size_suball) <- c("origin", "day", "Nbr_of_features", "Nbr_of_cells", "Analysis_step_name", "Step_number")
size_suball$Nbr_of_cells <- as.integer(as.character(size_suball$Nbr_of_cells))
size_suball$Nbr_of_features <- as.integer(as.character(size_suball$Nbr_of_features))
size_suball$day <- factor(size_suball$day, levels = c("4", "5", "5bis", "6", "7", "11"), ordered = TRUE)
knitr::kable(size_suball, align = "lrrrrlr")
origin day Nbr_of_features Nbr_of_cells Analysis_step_name Step_number
ref_day_04 ref 4 23961 6898 Preliminary Counts 01
ref_day_05 ref 5 23961 7783 Preliminary Counts 01
ref_day_06 ref 6 23961 8313 Preliminary Counts 01
ref_day_07 ref 7 23961 7502 Preliminary Counts 01
lab_day_04 lab 4 32286 7324 Preliminary Counts 01
lab_day_05 lab 5 32286 7794 Preliminary Counts 01
lab_day_05bis lab 5bis 32286 7729 Preliminary Counts 01
lab_day_06 lab 6 32286 6628 Preliminary Counts 01
lab_day_11 lab 11 32287 9584 Preliminary Counts 01
ggplot(size_suball, aes(x = day, y = Nbr_of_cells)) + geom_col(aes(color = origin, fill = origin), position = position_dodge(), 
    width = 0.7) + theme_classic() + theme(plot.title = element_text(hjust = 0.5)) + ggtitle("Number of cells in each sub-dataset") + 
    facet_grid(. ~ origin)

6 Quality control

The cells can be filtered out based on multiple criteria, like:

  • the number of features detected in a cell transcriptome,

  • the number of counts for each single cell or,

  • the percentage of mitochondrial counts.

Thus, the feature plot of the nCount_RNA an the percent.mt metadata will guide me regarding the threshold to apply for filtering out the low quality cells.

6.1 Percentage of mitochondrial reads & cutoffs

To determine the mitochondrial percentage in each cell, I use the mitochondrial gene name that always begins by “mt-”. The function PercentageFeatureSet enables to calculate the percentage of all counts that match the pattern. Here, I use the “mt-” pattern. (REFORMULATE)

allSO.list <- future_lapply(allSO.list, function(SO) {
    SO[["percent.mt"]] <- PercentageFeatureSet(SO, pattern = "^mt-")
    return(SO)
}, future.seed = 26)

The percentage of mitochondrial counts (or reads) is stored for each cell as a metadata under the name of percent.mt (see the example on the data of the lab, day 4).

orig.ident nCount_RNA nFeature_RNA day dataset replicate percent.mt
lab_day_04_0_AAACCCAAGTACAGAT lab_day_04 59952 6798 Day_04 lab_day_04 0 2.340206
lab_day_04_0_AAACCCAAGTTCATCG lab_day_04 38805 5932 Day_04 lab_day_04 0 1.154490
lab_day_04_0_AAACCCACAAGCGAAC lab_day_04 45520 6008 Day_04 lab_day_04 0 1.001758
lab_day_04_0_AAACCCACAAGGTTGG lab_day_04 26327 4497 Day_04 lab_day_04 0 1.762449
lab_day_04_0_AAACCCACACATATGC lab_day_04 28763 4954 Day_04 lab_day_04 0 2.718771
lab_day_04_0_AAACCCACACTAAACC lab_day_04 27481 5004 Day_04 lab_day_04 0 3.114879
lapply(allSO.list, function(SO) {
    VlnPlot(SO, features = c("percent.mt"), ncol = 3) + geom_hline(aes(yintercept = 10), color = "blue", size = 1.5) + 
        geom_hline(aes(yintercept = 1.5), color = "orange", size = 1.5)
})

The reference sub-datasets show a mitochondrial content that never exceeds 15% of the transcripts in cells. In addition, cells never have less than 1.5% of mitochondrial content. It is the reason why there is no orange line on the 4 first plots (plots on reference sub-datasets). One can suggest that the data were already filtered. In contrast, some cells in the lab data have a mitochondrial content close to 80%.

I will keep the cells whose percentage is between 1.5% and 10% (orange and blue lines, resp.).

Information coming from 10X Genomics

Apoptotic cells express mitochondrial genes and export these transcripts to the cytoplasm in mammalian cells. Lysed cells with intact mitochondria may be partitioned into GEMs, increasing the fraction of mitochondrial transcripts detected.

It is also mentioned in Luecken MD and Theis FJ Mol Syst Biol. 2019 that a high mitochondrial content may represent doublets.

6.2 Counts’ thresholds

Rossi et al. was filtering out cells with a number of read counts lower than 2000 (green line). Added to this one, I will also remove the cells with a number of read counts higher than 150.000 (purple line).

lapply(allSO.list, function(SO) {
    VlnPlot(SO, features = c("nCount_RNA"), ncol = 3) + geom_hline(aes(yintercept = 150000), color = "purple", 
        size = 1.5) + geom_hline(aes(yintercept = 3000), color = "green", size = 1.5)
})

The reference sub-datasets never have less than 3000 read counts. It is the reason why there is no purple line on the 4 first plots (plots on reference sub-datasets). Actually, it appears that they never have less than 5000 read counts.

Furthermore, the higher threshold leads to filter out some outliers either in reference and laboratory sub-datasets. The 2 datasets of the laboratory fifth day sequencing do not reach such number read counts.

6.3 Minimum of feature’s diversity

The number of feature per cell is also assessed for quality control.

lapply(allSO.list, function(SO) {
    VlnPlot(SO, features = c("nFeature_RNA"), ncol = 3) + geom_hline(aes(yintercept = 1800), color = "#56B4E9", 
        size = 1.5)
})

The reference sub-datasets never have less than 2000 features per cell, while the lab sub-datasets range from 38 to 10064 features per cell. Many cells of the lab sub-datasets show a low number of detected features. Thus, a minimum of 1800 detected features per cell will be requested.

# lapply(allSO.list, function(SO) { print(names(SO@meta.data)) FeatureScatter(object=SO,
# feature1='nCount_RNA', feature2='nFeature_RNA', pt.size=0.8, cols='percent.mt') + geom_point(data =
# data.frame(SO@meta.data$nCount_RNA, SO@meta.data$nFeature_RNA, SO@meta.data$percent.mt), aes(x =
# 'nCount_RNA', y = 'nFeature_RNA'), color='percent.mt') + geom_hline(aes(yintercept = 1500),
# color='#56B4E9', size=1) + geom_vline(aes(xintercept = 150000), color='purple', size=1) +
# geom_vline(aes(xintercept = 3000), color='green', size=1) + ggtitle(paste0('Nb of features as function
# of Nb of counts\n', SO@project.name)) + scale_color_continuous(type = 'viridis') })

lapply(allSO.list, function(SO) {
    ggplot(SO@meta.data, aes(nCount_RNA, nFeature_RNA, colour = percent.mt)) + geom_point() + lims(colour = c(0, 
        100)) + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + ylab("nFeature_RNA") + geom_hline(aes(yintercept = 1800), 
        color = "#56B4E9", size = 1) + geom_vline(aes(xintercept = 150000), color = "purple", size = 1) + 
        geom_vline(aes(xintercept = 3000), color = "green", size = 1) + ggtitle(paste0("Nb of features as function of Nb of counts\n", 
        SO@project.name))
})

Considering the multiple quality control criteria all together, I confirm the previously settled thresholds. The low quality cells will be removed. The cells with a high content of mitochondrial expressed genes are under the blue line (minimum of 1800 features), and also at the left of the green line (3.000 read counts minimum). Such cell with a low count depth (count per cell), few detected features and a high mitochondrial content are indicative of cells whose cytoplasmic mRNA has leaked out through a broken membrane, and thus, only mRNA located in the mitochondria is still conserved (Luecken MD and Theis FJ Mol Syst Biol. 2019).

6.4 Isolated features

Among the thousands of detected features, some of them are expressed in few cells. Such features do not provide information on the cell heterogeneity. It is also good to know that filtering out features expressed in less than 50 cells (red dashed line) will make difficult to detect clusters that could gather less than 50 cells.

# Create dataframe
countcells.df.list <- lapply(allSO.list, function(SO) {
    df <- data.frame(rowSums(SO@assays$RNA@counts != 0))
    df$features <- rownames(df)
    colnames(df) <- c("Nbr_of_cells", "features")
    rownames(df) <- NULL
    return(df)
})

lapply(1:length(countcells.df.list), function(i) {
    project.name <- names(countcells.df.list)[i]
    df <- countcells.df.list[[i]]
    # Plot gene expression repartition
    ggplot(df, aes(x = Nbr_of_cells)) + geom_histogram() + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + 
        ylab("frequency") + scale_x_continuous(trans = "log10") + expand_limits(x = c(0, 10500), y = c(0, 
        2000)) + geom_vline(aes(xintercept = 10), color = "green", size = 1) + geom_vline(aes(xintercept = 50), 
        color = "red", size = 1, linetype = "dashed") + ggtitle(paste0("Repartition of expressed features\n", 
        project.name))
})

In order to limit the number of dropouts, and still, being able to identify small clusters, the features expressed in less than 10 cells will be removed.

6.5 Apply QC

In this section, I apply all the cutoffs determined in the previous steps. As a result, the sub-datasets will have either less features and cells.

allSO.list <- lapply(1:length(allSO.list), function(i) {
    df <- countcells.df.list[[i]]
    SO <- allSO.list[[i]]
    SO <- subset(SO, features = which(df$Nbr_of_cells > 10), subset = percent.mt > 1.5 & percent.mt < 10 & 
        nCount_RNA > 3000 & nCount_RNA < 150000 & nFeature_RNA > 1800)
    return(SO)
})
names(allSO.list) <- c(ref_dataset_names, lab_dataset_names)

Size of all the sub-datasets after filtering:

size_suball2 <- data.frame(cbind(
    sapply(allSO.list, function(SO){
        if (str_detect(SO@project.name, "^ref")){ "ref" } else { "lab" }
    }),
    sapply(allSO.list, function(SO){
        if (str_detect(SO@project.name, "04")) { "4" }
        else if (str_detect(SO@project.name, "05bis")) { "5bis" }
        else if (str_detect(SO@project.name, "05")) { "5" }
        else if (str_detect(SO@project.name, "06")) { "6" }
        else if (str_detect(SO@project.name, "07")) { "7" }
        else if (str_detect(SO@project.name, "11")) { "11" }
    }),
    
    t(sapply(allSO.list, dim)),
    "QC_filtering",
    "02"
))
colnames(size_suball2) <- c('origin', 'day', 'Nbr_of_features', 'Nbr_of_cells', 'Analysis_step_name', 'Step_number')
size_suball2$Nbr_of_cells <- as.integer(as.character(size_suball2$Nbr_of_cells))
size_suball2$Nbr_of_features <- as.integer(as.character(size_suball2$Nbr_of_features))
size_suball2$day <- factor(size_suball2$day, levels = c('4', '5', '5bis', '6', '7', '11'), ordered = TRUE)
size_suball_track <- rbind(size_suball, size_suball2)
knitr::kable(size_suball_track) %>%
  scroll_box(width = "100%", height = "200px")
rm(size_suball, size_suball2)
origin day Nbr_of_features Nbr_of_cells Analysis_step_name Step_number
ref_day_04 ref 4 23961 6898 Preliminary Counts 01
ref_day_05 ref 5 23961 7783 Preliminary Counts 01
ref_day_06 ref 6 23961 8313 Preliminary Counts 01
ref_day_07 ref 7 23961 7502 Preliminary Counts 01
lab_day_04 lab 4 32286 7324 Preliminary Counts 01
lab_day_05 lab 5 32286 7794 Preliminary Counts 01
lab_day_05bis lab 5bis 32286 7729 Preliminary Counts 01
lab_day_06 lab 6 32286 6628 Preliminary Counts 01
lab_day_11 lab 11 32287 9584 Preliminary Counts 01
ref_day_041 ref 4 17599 6726 QC_filtering 02
ref_day_051 ref 5 17973 7637 QC_filtering 02
ref_day_061 ref 6 17918 8133 QC_filtering 02
ref_day_071 ref 7 17951 7385 QC_filtering 02
lab_day_041 lab 4 17268 5518 QC_filtering 02
lab_day_051 lab 5 18368 6728 QC_filtering 02
lab_day_05bis1 lab 5bis 18184 6689 QC_filtering 02
lab_day_061 lab 6 17547 5196 QC_filtering 02
lab_day_111 lab 11 17952 6398 QC_filtering 02
ggplot(size_suball_track, aes(x = day, y = Nbr_of_cells)) + geom_col(aes(color = Analysis_step_name, fill = Analysis_step_name), 
    position = position_dodge(), width = 0.6) + theme_classic() + theme(plot.title = element_text(hjust = 0.5)) + 
    ggtitle("Number of cells in each sub-dataset\nafter QC") + facet_grid(. ~ origin)

SAY SOMETHING THAT ROSSI ET AL DATA WERE ALREADY PREPROCESSED

7 Seurat Standard Preprocessing Workflow

To compare the gene expression across multiple cells, the data have to be normalized. I use the default parameter of the NormalizeData function. It means that the method used is called LogNormalize. With this method, the feature counts for each cell are divided by the total counts that cell and multiplied by the scale.factor (10.000, by default). This is then natural-log transformed using log1p.

allSO.list <- future_lapply(allSO.list, NormalizeData, verbose = FALSE, future.seed = 26)

The normalization is calculated from the raw counts slot named counts. The result is then stored in an another slot named data.

knitr::kable(data.frame(GetAssayData(allSO.list$ref_day_04, slot = "counts")[1:5, 1:5]), caption = "Raw counts in 'counts' slot") %>%
    scroll_box(width = "800px", height = "200px")
knitr::kable(data.frame(GetAssayData(allSO.list$ref_day_04, slot = "data")[1:5, 1:5]), caption = "Normalized counts in 'data' slot") %>%
    scroll_box(width = "800px", height = "200px")
Raw counts in ‘counts’ slot
ref_day_04_0_AAACCCATCGACTCCT ref_day_04_0_AAACGAAGTCATAACC ref_day_04_0_AAACGCTCAAGCGAGT ref_day_04_0_AAACGCTCACCCTAAA ref_day_04_0_AAAGAACAGCCTTGAT
Xkr4 0 0 0 0 0
Rp1 0 0 0 0 0
Sox17 5 0 0 0 0
Mrpl15 9 5 6 16 9
Lypla1 1 0 1 5 0
Normalized counts in ‘data’ slot
ref_day_04_0_AAACCCATCGACTCCT ref_day_04_0_AAACGAAGTCATAACC ref_day_04_0_AAACGCTCAAGCGAGT ref_day_04_0_AAACGCTCACCCTAAA ref_day_04_0_AAAGAACAGCCTTGAT
Xkr4 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
Rp1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
Sox17 0.6628109 0.0000000 0.0000000 0.0000000 0.0000000
Mrpl15 0.9904438 0.7849221 0.8650819 1.1922741 0.8718615
Lypla1 0.1723114 0.0000000 0.2063636 0.5406086 0.0000000

I continue with the standard preprocessing steps, using default parameters:

  • highly variable features (HVG) detection, with the function FindVariableFeatures , where 2000 features are identified as having the most cell-to-cell variation,

  • scale the data, via a linear transformation, using the ScaleData function, on all features (more details below),

  • perform linear reduction dimension with Principal Component Analysis (PCA) to identify the sources of heterogeneity in every sub-dataset, through 50 principal components (PCs),

  • plot an ElbowPlot to visualize the most informative PCs.

allSO.list <- future_lapply(allSO.list, function(SO) {
    SO <- FindVariableFeatures(SO, nfeatures = 2000, verbose = FALSE)
    SO <- ScaleData(SO, features = rownames(SO), verbose = FALSE)
    SO <- RunPCA(SO, verbose = FALSE)
    return(SO)
}, future.seed = 26)

The function ScaleData performs a linear transformation (so called scaling). In detail, it shifts gene expressions to make the mean at 0 (negative values arise), and scales the gene expressions to have their variance equal to 1. Concretely, it standardizes the expression of each gene. This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate (<https://satijalab.org/seurat/articles/pbmc3k_tutorial.html>).

From the scaling step, a new slot named scale.data has been created. Below is an example of the same features in the same cells as above.

knitr::kable(data.frame(GetAssayData(allSO.list$ref_day_04, slot = "scale.data")[1:5, 1:5]), caption = "Scaled data in 'scale.data' slot") %>%
    scroll_box(width = "800px", height = "200px")
Scaled data in ‘scale.data’ slot
ref_day_04_0_AAACCCATCGACTCCT ref_day_04_0_AAACGAAGTCATAACC ref_day_04_0_AAACGCTCAAGCGAGT ref_day_04_0_AAACGCTCACCCTAAA ref_day_04_0_AAAGAACAGCCTTGAT
Xkr4 -0.2033840 -0.2033840 -0.2033840 -0.2033840 -0.2033840
Rp1 -0.0341485 -0.0341485 -0.0341485 -0.0341485 -0.0341485
Sox17 1.7263643 -0.2826478 -0.2826478 -0.2826478 -0.2826478
Mrpl15 -0.0977819 -0.7710953 -0.5084824 0.5634377 -0.4862717
Lypla1 -0.3483848 -1.1274818 -0.1944192 1.3168522 -1.1274818

The PCA was performed with the identified HVG as input. The Elbow plots below show for each sub-datasets how much each PC is informative - in term of the percentage of variance.

In the analysis realized by Rossi et al., the PCA was done to compute 30 PCs only. All the downstream analysis was considering all the PCs. Here, I use the default parameters of the RunPCA function. Thus, it computes 50 PCs.

lapply(allSO.list, function(SO) {
    ElbowPlot(SO, ndims = 50) + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + geom_vline(aes(xintercept = 30), 
        color = "purple", size = 1) + ggtitle(paste0("ElbowPlot\n", SO@project.name))
})

By taking into account the advises from the Seurat - Guided Clustering Tutorial vignette, the downstream analysis will be done on the 30 first PCs for all the sub-datasets.

Then, I go through the community detection among the cells. The edges are constructed using a k-nearest neighbors (KNN) graph, based on the euclidean distances previously obtained on the PCA space. The weights of the edges are based on the shared overlap in their local neighborhood, method call Jaccard similarity. These 2 steps are done by using the FindNeighbors function. (https://satijalab.org/seurat/articles/pbmc3k_tutorial.html)

Once I obtain a graph on the data, I run the FindClusters function. It applies the Louvain algorithm (by default). This algorithm optimizes the modularity of the graph.

  • Modularity allows to assess the best division of a network (or graph). It also depends on a resolution parameter. Higher the value of the resolution, higher the number of clusters. *

Originally, the analysis realized by Rossi et al. set the resolution parameter at 4.6. Here, I test the Louvain algorithm according to multiple levels of resolution (0.1, 0.5, 1, 2 and 5).

res <- c(0.1, 0.5, 1, 2, 5)
allSO.list <- future_lapply(allSO.list, function(SO) {
    SO <- FindNeighbors(SO, dims = 1:30, verbose = FALSE)
    SO <- FindClusters(SO, resolution = res, verbose = FALSE)
    return(SO)
}, future.seed = 26)

The relationship between the resolution’s levels and the number of clusters is presented below.

data <- c()
resolution <- c()
nb_clusters <- c()
for (SO in allSO.list) {
    for (ares in res) {
        nb_levels <- nlevels(SO@meta.data[[paste0("RNA_snn_res.", ares)]])

        data <- append(data, SO@project.name)
        resolution <- append(resolution, ares)
        nb_clusters <- append(nb_clusters, nb_levels)
    }
}

df <- data.frame(data, resolution, nb_clusters)
knitr::kable(pivot_wider(df, names_from = "data", values_from = "nb_clusters"), caption = "Number of clusters in sub-datasets\naccording to the resolution level")
Number of clusters in sub-datasets according to the resolution level
resolution ref_day_04 ref_day_05 ref_day_06 ref_day_07 lab_day_04 lab_day_05 lab_day_05bis lab_day_06 lab_day_11
0.1 5 5 6 7 3 4 4 5 10
0.5 10 10 11 14 10 8 8 14 15
1.0 14 14 17 18 13 12 14 18 22
2.0 24 25 24 26 23 22 21 30 32
5.0 42 44 43 47 47 44 46 48 48

By running a non-linear dimension reduction such as the Uniform Manifold Approximation and Projection (UMAP), he clusters can be visualized.

allSO.list <- lapply(allSO.list, RunUMAP, dims = 1:30, verbose = FALSE)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
for (ares in res) {
    print(DimPlot(allSO.list$lab_day_11, group.by = paste0("RNA_snn_res.", ares), label = TRUE, reduction = "umap") + 
        ggtitle(paste0("Resolution level: ", ares, "\n", allSO.list$lab_day_11@project.name)) + theme(plot.title = element_text(hjust = 0.5)))
}
## Warning: Using `as.character()` on a quosure is deprecated as of rlang 0.3.0.
## Please use `as_label()` or `as_name()` instead.
## This warning is displayed once per session.

For the sub-dataset day_11 of the laboratory, it appears that the best value for the resolution parameter should be at 1. I set the Idents of all sub-datasets to the clusters defined under the resolution level 1.

allSO.list <- lapply(allSO.list, function(SO) {
    Idents(SO) <- SO$RNA_snn_res.1
    print(DimPlot(SO, label = TRUE, reduction = "umap") + ggtitle(paste0("Resolution level: 1\n", SO@project.name)) + 
        theme(plot.title = element_text(hjust = 0.5)))
    return(SO)
})
gc()

8 Remove doublets

## pK Identification (no ground-truth) ---------------------------------------------------------------------------------------
pK.list <- sapply(allSO.list,
                  function(SO){
                    sweep.res.list_SO <- paramSweep_v3(SO, PCs = 1:30) # as estimated from PC elbowPlot
                    sweep.stats_SO <- summarizeSweep(sweep.res.list_SO, GT = FALSE)
                    bcmvn_SO <- find.pK(sweep.stats_SO)
                    
                    ggplot(bcmvn_SO, aes(pK, BCmetric, group = 1)) +
                      geom_point() +
                      geom_line()
                    
                    pK <- bcmvn_SO %>% # select the pK that corresponds to max bcmvn to optimize doublet detection
                      filter(BCmetric == max(BCmetric)) %>%
                      select(pK) 
                    pK <- as.numeric(as.character(pK[[1]]))
                    return(pK)
                  })


## Load pijuansala classifier (/!\ not AI) -----------------------------------------------------------------------------------
scmap_classifier <- readRDS(file=paste0(classifier.folder, "scmap_classifier_1000markers.rds"))
ref_stages <- c("E6.5", "E6.75", "E7.0", "E7.25", "E7.5", "E7.75", "E8.0", "E8.25", "E8.5")

## Annotate cells from pijuansala classifier ---------------------------------------------------------------------------------
allSO.list <- lapply(allSO.list,
                     function(SO){
                       set.seed(1234567)
                       
                       # create the coresponding SingleCellExperiment object
                       sce <- as.SingleCellExperiment(x=SO)
                       rowData(sce)$feature_symbol <- rownames(sce)
                       counts(sce) <- as.matrix(counts(sce))
                       logcounts(sce) <- as.matrix(logcounts(sce))
                       
                       # apply scmapCluster
                       scmapCluster_results <- scmapCluster(projection=sce, index_list=scmap_classifier[ref_stages], threshold=0)
                       
                       # add the celltype from the in vivo atlas to the Seurat object
                       SO <- AddMetaData(object=SO, metadata=scmapCluster_results$combined_labs, col.name="celltype_DF")
                       
                       # save memory, do garbage collection
                       rm(sce)
                       gc()
                       
                       return(SO)
                     })

lapply(allSO.list,
       function(SO){
         Idents(SO) <- SO$celltype_DF
         DimPlot(SO, pt.size = 0.8, label = T, label.size = 6, repel = T,
                 cols = colors.use_transferred[levels(Idents(SO))]) +
           NoLegend() +
           ggtitle(SO@project.name) +
           theme(plot.title = element_text(hjust = 0.5))
       }
)

## Homotypic Doublet Proportion Estimate -------------------------------------------------------------------------------------
dblts_nonhomo <- sapply(allSO.list,
                        function(SO){
                          # based on annotation (celltype from pijuansala classifier)
                          # could be also the louvain clusters determined from the preprocessus step
                          annotations <- SO@meta.data$celltype_DF
                          homotypic.prop <- modelHomotypic(annotations)
                          nDoublets <- round(ncol(SO)*7.6/100)
                          nDoublets_nonhomo <- round(nDoublets*(1-homotypic.prop))

              return(nDoublets_nonhomo)
                        })


# run doubletFinder ---------------------------------------------------------------------------------------------------------- 
allSO.list <- lapply(allSO.list,
                     function(SO){
                       SO <- doubletFinder_v3(SO, 
                                              PCs = 1:30, 
                                              pN = 0.25, 
                                              pK = pK.list[[SO@project.name]], 
                                              nExp = dblts_nonhomo[[SO@project.name]])
                     })


# remove the cells tagged as doublet -----------------------------------------------------------------------------------------
allSO.list <- lapply(allSO.list,
                     function(SO){
                       # remove the cells identified as doublets of each of the sub-datasets
                       col_dblts <- grep("DF.classifications", colnames(SO@meta.data), value=TRUE)
                       Idents(SO) <- col_dblts
                       SO <- subset(SO, idents='Singlet')
                       
                       # remove useless columns
                       SO@meta.data[[grep("DF.classifications", colnames(SO@meta.data))]] <- NULL
                       names(SO@meta.data)[names(SO@meta.data) == grep("pANN",colnames(SO@meta.data),value=T)] <- "pANN"
                       
                       return(SO)
                     })


## Check size of the subdatasets ---------------------------------------------------------------------------------------------
size_suball3 <- data.frame(cbind(
    sapply(allSO.list, function(SO){
        if (str_detect(SO@project.name, "^ref")){ "ref" } else { "lab" }
    }),
    sapply(allSO.list, function(SO){
        if (str_detect(SO@project.name, "04")) { "4" }
        else if (str_detect(SO@project.name, "05bis")) { "5bis" }
        else if (str_detect(SO@project.name, "05")) { "5" }
        else if (str_detect(SO@project.name, "06")) { "6" }
        else if (str_detect(SO@project.name, "07")) { "7" }
        else if (str_detect(SO@project.name, "11")) { "11" }
    }),
    
    t(sapply(allSO.list, dim)),
    "Removed Doublets",
    "03"
))
colnames(size_suball3) <- c('origin', 'day', 'Nbr_of_features', 'Nbr_of_cells', 'Analysis_step_name', 'Step_number')
size_suball3$Nbr_of_cells <- as.integer(as.character(size_suball3$Nbr_of_cells))
size_suball3$Nbr_of_features <- as.integer(as.character(size_suball3$Nbr_of_features))
size_suball3$day <- factor(size_suball3$day, levels = c('4', '5', '5bis', '6', '7', '11'), ordered = TRUE)
size_suball_track <- rbind(size_suball_track, size_suball3)
knitr::kable(size_suball_track)
rm(size_suball3)

ggplot(size_suball_track, aes(x = day, y = Nbr_of_cells)) +
    geom_col(aes(color = Analysis_step_name, fill = Analysis_step_name), position = position_dodge(), width = 0.6) +
    theme_classic() +
    theme(plot.title = element_text(hjust = 0.5)) +
    ggtitle("Number of cells in each sub-dataset\nafter doublets removal") +
    facet_grid(.~ origin)


## back to preprocessing analysis recluster cells after removing doublets ----------------------------------------------------
res <- seq(0.8, 1.6, 0.2)
allSO.list <- lapply(allSO.list,
                     function(SO){
                       SO<- NormalizeData(SO, verbose=FALSE)
                       SO<- FindVariableFeatures(SO, nfeatures=2000, verbose=FALSE)
                       SO<- ScaleData(SO, features=rownames(SO), verbose=FALSE)
                       SO<- RunPCA(SO, verbose=FALSE)
                       ElbowPlot(SO, ndims = 50) +
                         theme_minimal() +
                         theme(plot.title = element_text(hjust = 0.5)) +
                         geom_vline(aes(xintercept = 30), color="purple", size=1) +
                         ggtitle(paste0("ElbowPlot\n", SO@project.name))
                       SO<- FindNeighbors(SO, dims = 1:30, verbose=FALSE)                       
                       SO<- FindClusters(SO, resolution = res, verbose=FALSE)                       
                       SO <- RunUMAP(SO, dims = 1:30)
                       for (ares in res){
                         print(DimPlot(SO, group.by = paste0("RNA_snn_res.", ares), label = TRUE, reduction = "umap") +
                                 ggtitle(paste0("Resolution level: ", ares, "\n", SO@project.name)) +
                                 theme(plot.title = element_text(hjust=0.5)))
                       }
                       return(SO)
                     })

9 Sources of unwanted variation

There are several sources of unwanted variation while doing a single cell RNA sequencing analysis. Unwanted variation might come from a batch effect, but also from the cell-cycle state the cells are, or even the mitochondrial and ribosomal expression level. I will check the batch effect on reference data only, given that I have only one batch for each sub-dataset of the lab. Then, I will also check the cell-cycle effect on all the sub-datasets.
Regarding the mitochondrial expression level, I will not study it given that I already applied thresholds on this variable.

9.1 Batch effect

As already mentioned, from the lab, we have only one replicate for the sequencing data of day 04, day 06 and day 11. The day 05 has been resequenced. Looking at the two datasets (lab_day_05 and lab_day_05bis), they are enough similar to be considered as batches of a technical replicate.

9.1.1 Lab data on day 5

# Get only reference sub-datasets
lab_datasets <- c("lab_day_05", "lab_day_05bis")
mergeDay5 <- merge(x = allSO.list$lab_day_05, y = allSO.list$lab_day_05bis, project = "lab_day_05", merge.data = T)
mergeDay5@meta.data[["dataset"]] <- "lab_day_05"
Idents(mergeDay5) <- mergeDay5$dataset
mergeDay5 <- NormalizeData(object = mergeDay5, normalization.method = "LogNormalize", scale.factor = 10000, 
    verbose = FALSE)
mergeDay5 <- FindVariableFeatures(object = mergeDay5, nfeatures = 1000, verbose = FALSE)
mergeDay5 <- ScaleData(object = mergeDay5, verbose = FALSE)
mergeDay5 <- RunPCA(mergeDay5, npcs = 30, verbose = FALSE)

# function to do DimPlot entitled with sub-dataset name and if PCA is done before or after the regression
batch_DimPlot <- function(SO, when = NA) {

    Idents(SO) <- SO$replicate
    DimPlot(SO) + ggtitle(paste0("Replicates of ", SO@project.name, "\nin the 2 first PCs ", when, " regression"))
}

batch_DimPlot(mergeDay5, when = "before")

The two batches share a similar expression in the complete space of the two firsts PC.

Below, the batch correction is realized. This step allows us to determine if whether the batch correction is required or not.

correctedDay5 <- ScaleData(mergeDay5, vars.to.regress = "replicate", verbose = FALSE)
correctedDay5 <- RunPCA(correctedDay5)
batch_DimPlot(correctedDay5, when = "after")

The data after batch correction doesn’t provide more information regarding the two firsts PC. The batch correction will not be kept for the rest of the analysis. The replicates of the day 05 will simply merged.

allSO.list$lab_day_05 <- mergeDay5
allSO.list$lab_day_05bis <- NULL
SO.names <- names(allSO.list)
rm(mergeDay5, correctedDay5)
gc()

9.1.2 Reference data

The reference data only will be tested on a batch effect regression.

# Get only reference sub-datasets
ref_datasets <- c("ref_day_04", "ref_day_05", "ref_day_06", "ref_day_07")
ref_SO.list <- allSO.list[ref_datasets]

# function to do DimPlot entitled with sub-dataset name and if PCA is done before or after the regression
batch_DimPlot <- function(SO, when = NA) {

    Idents(SO) <- SO$replicate
    DimPlot(SO) + ggtitle(paste0("Replicates of ", SO@project.name, "\nin the 2 first PCs ", when, " regression"))
}

future_lapply(ref_SO.list, batch_DimPlot, when = "before", future.seed = 26)

The variation throughout data does not appear to be batch based. Let see what occurs after regressing out on the batch value.

ref_SO.list <- future_lapply(ref_SO.list, ScaleData, vars.to.regress = "replicate", verbose = FALSE, future.seed = 26)
ref_SO.list <- future_lapply(ref_SO.list, function(x) RunPCA(x, features = VariableFeatures(x)), future.seed = 26)
future_lapply(ref_SO.list, batch_DimPlot, when = "after", future.seed = 26)

As observed previously, the data variation in the reference data is not lead by the replicates.

In conclusion, there is no need to apply the regression on the sub-datasets.

9.2 cell-cycle effect

It is known that the cell-cycle state of the cells influence the pattern of gene expression. This part is largely inspired by the cell_cycle_vignette provided by Seurat.

9.2.1 Retrieve cell-cycle markers

After getting the list of the cell-cycle markers, I want to know if they have been detected in the sequencing.

# A list of cell-cycle markers, from Tirosh et al, 2015, is loaded with Seurat
s.genes <- str_to_title(cc.genes$s.genes)
g2m.genes <- str_to_title(cc.genes$g2m.genes)

# Number of markers for the S and G2M phases
sapply(list(S_phase = s.genes, G2M_phase = g2m.genes), length)

# Are cell-cycle markers in my data ?  check for the length of the intersection
sapply(allSO.list, function(x) length(intersect(rownames(x@assays$RNA), s.genes)))
sapply(allSO.list, function(x) length(intersect(rownames(x@assays$RNA), g2m.genes)))
##   S_phase G2M_phase 
##        43        54 
## ref_day_04 ref_day_05 ref_day_06 ref_day_07 lab_day_04 lab_day_05 lab_day_06 
##         42         42         42         42         42         42         42 
## lab_day_11 
##         42 
## ref_day_04 ref_day_05 ref_day_06 ref_day_07 lab_day_04 lab_day_05 lab_day_06 
##         52         52         52         52         52         52         52 
## lab_day_11 
##         52

Most of the markers have been sequenced in each of the sub-datasets.

9.2.2 Determine the cell-cycle phase of the cells

Now, I can assign cell-cycle scores. It is a score based on the expression of G2/M and S phase markers of each of the cells.

# Apply a cell cycle score
cc_SO.list <- future_lapply(allSO.list, CellCycleScoring, s.features = s.genes, g2m.features = g2m.genes, 
    verbose = FALSE, future.seed = 26)

9.2.3 Plots generation before cell-cycle regression

I create a generic function to plot the data in the first 2 PCs. Each plot is titled depending whether the PC analysis rely on variables features or the cell-cycle markers, and giving the information the PCA has been done before or after the cell-cycle regression.

run_DimPlot <- function(SO, pca = NA, when = NA) {
    Idents(SO) <- SO$Phase
    p <- DimPlot(SO) + ggtitle(paste0("PCA: ", pca, "\nRegression: ", when)) + theme(text = element_text(size = 10), 
        plot.title = element_text(face = "plain"))
    return(p)
}

First call to that function. The PCA on variable features has already be done during the Pre-process Seurat objects part. Then, a new PCA is done on the cell-cycle markers, followed by a new call to the previously described function.

beforeAll <- future_lapply(cc_SO.list, run_DimPlot, pca = "variable features", when = "before", future.seed = 26)
names(beforeAll) <- SO.names

cc_SO.list <- future_lapply(cc_SO.list, RunPCA, features = c(s.genes, g2m.genes), verbose = FALSE, future.seed = 26)

beforeCCgenes <- future_lapply(cc_SO.list, run_DimPlot, pca = "cell-cycle markers", when = "before", future.seed = 26)
names(beforeCCgenes) <- SO.names

The plots are stored in lists, under the names of the sub-datasets.

9.2.4 Run the cell-cycle regression

# Regression
cc_SO.list <- future_lapply(cc_SO.list, function(x) ScaleData(x, vars.to.regress = c("S.Score", "G2M.Score"), 
    verbose = FALSE), future.seed = 26)

9.2.5 Plots generation after cell-cycle regression

# Plots generation
cc_SO.list <- future_lapply(cc_SO.list, RunPCA, verbose = FALSE, future.seed = 26)
afterAll <- future_lapply(cc_SO.list, run_DimPlot, pca = "variable features", when = "after", future.seed = 26)
names(afterAll) <- SO.names

cc_SO.list <- future_lapply(cc_SO.list, RunPCA, features = c(s.genes, g2m.genes), verbose = FALSE, future.seed = 26)
afterCCgenes <- future_lapply(cc_SO.list, run_DimPlot, pca = "cell-cycle markers", when = "after", future.seed = 26)
names(afterCCgenes) <- SO.names

9.2.6 Plots display

displayPlot <- function(i) {
    p1 <- beforeCCgenes[[i]]
    p2 <- beforeAll[[i]]
    p3 <- afterCCgenes[[i]]
    p4 <- afterAll[[i]]

    grid.arrange(p1, p2, p3, p4, nrow = 2, top = textGrob(SO.names[i], hjust = 0.5, gp = gpar(fontsize = 18, 
        fontface = "bold")))

}

lapply(1:length(cc_SO.list), displayPlot)

It is expected to see a segregation between the cells according to their cell-cycle phase, as the PCA is done using the s.genes and g2m.genes lists. Running a PCA on cell cycle markers reveals, unsurprisingly, that cells separate entirely by phase

10 Conclusion

In this part, I orgnaized the datasets into sub-datasets relating to their origin, reference or lab, and to the day of the experiments they have been obtained. Then, I cleaned the sub-datasets by studying the mitochondrial expression in the cells, and removing the outliers. I also removed the doublets. And, at the end I tested whether or not it is necessary to regress the sub-datasets over the batch effect or the cell-cycle phase.

All these sub-datasets have to be saved, to be used in the second part of the study, doing it by analysis.

future_lapply(allSO.list, function(x) saveRDS(x, file = paste0(rdsObjects, "01_dailySO_", x@project.name, 
    ".rds")), future.seed = 26)